Library Imports

from datetime import datetime

from pyspark.sql import SparkSession
from pyspark.sql import functions as F

Template

spark = (
    SparkSession.builder
    .master("local")
    .appName("Exploring Joins")
    .config("spark.some.config.option", "some-value")
    .getOrCreate()
)

sc = spark.sparkContext

Initial Datasets

pets = spark.createDataFrame(
    [
        (1, 1, 'Bear', 5),
        (2, 1, 'Chewie', 10),
        (3, 2, 'Roger', 15),
    ], ['id', 'breed_id', 'nickname', 'age']
)

pets.toPandas()

	id	breed_id	nickname	age
0	1	1	Bear	5
1	2	1	Chewie	10
2	3	2	Roger	15

groupby_columns = ['breed_id']

Option 1: Using a Dictionary

df_1 = (
    pets
    .groupby(groupby_columns)
    .agg({
        "*": "count",
        "age": "sum",
    })
)

df_1.toPandas()

	breed_id	count(1)	sum(age)
0	1	2	15
1	2	1	15

What Happened:

Very similar to pandas agg function.
The resultant column names are a bit awkward to use after the fact.

Option 2: Using List of Columns

df_2 = (
    pets
    .groupby(groupby_columns)
    .agg(
        F.count("*"),
        F.sum("age"),
    )
)

df_2.toPandas()

	breed_id	count(1)	sum(age)
0	1	2	15
1	2	1	15

What Happened:

Here we use the Spark agg functions.
Again, the resultant column names are a bit awkward to use after the fact.

Option 3: Using List of Columns, with Aliases

df_3 = (
    pets 
    .groupby(groupby_columns)
    .agg(
        F.count("*").alias("count_of_breeds"),
        F.sum("age").alias("total_age_of_breeds"),
    )
)

df_3.toPandas()

	breed_id	count_of_breeds	total_age_of_breeds
0	1	2	15
1	2	1	15

What Happened:

Here we use the Spark agg functions and aliased the resultant columns to new names.
This provides cleaner column names that we can use later on.

Summary

I encourage using option #3.

This creates more elegant and meaning names for the new aggregate columns.

A withColumnRenamed can be performed after the aggregates, but why not do it with an alias? It's easier as well.

Section 1 - Clean Aggregations

Library Imports

Template

Initial Datasets

Option 1: Using a Dictionary

Option 2: Using List of Columns

Option 3: Using List of Columns, with Aliases

Summary

results matching ""

No results matching ""